Skip to content

Comments

[SPARK-28878][SQL] Remove extra project for DSv2 reads with columnar batches#25586

Closed
rdblue wants to merge 1 commit intoapache:masterfrom
rdblue:SPARK-28878-remove-dsv2-project-with-columnar
Closed

[SPARK-28878][SQL] Remove extra project for DSv2 reads with columnar batches#25586
rdblue wants to merge 1 commit intoapache:masterfrom
rdblue:SPARK-28878-remove-dsv2-project-with-columnar

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Aug 26, 2019

What changes were proposed in this pull request?

Remove unnecessary physical projection added to ensure rows are UnsafeRow when the DSv2 scan is columnar. This is not needed because conversions are automatically added to convert from columnar operators to UnsafeRow when the next operator does not support columnar execution.

Why are the changes needed?

Removes an extra projection and copy.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Aug 26, 2019

Test build #109757 has finished for PR 25586 at commit 3fbb5dd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Aug 27, 2019

Retest this please.

@SparkQA
Copy link

SparkQA commented Aug 27, 2019

Test build #109828 has finished for PR 25586 at commit 3fbb5dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue rdblue changed the title [SPARK-28878[SQL][WIP] Remove extra project for DSv2 reads with columnar batches. [SPARK-28878][SQL][WIP] Remove extra project for DSv2 reads with columnar batches. Sep 3, 2019
@prodeezy
Copy link

prodeezy commented Sep 4, 2019

@rdblue This looks very useful and pertinent to Vectorization support use-case in DSV2 and should help in further improving performance. Thanks for working on this!
Is this mostly pending unit tests/ docs for the "WIP" to come off or is there more work to be one on this optimization?

@rdblue rdblue changed the title [SPARK-28878][SQL][WIP] Remove extra project for DSv2 reads with columnar batches. [SPARK-28878][SQL] Remove extra project for DSv2 reads with columnar batches Sep 4, 2019
@rdblue
Copy link
Contributor Author

rdblue commented Sep 4, 2019

@prodeezy, I just needed to verify that there are tests for the columnar path and there are: JavaColumnarDataSourceV2 is used in DataSourceV2Suite.

@rdblue
Copy link
Contributor Author

rdblue commented Sep 4, 2019

@brkyvz, can you take a look at this?

@rdblue rdblue requested a review from cloud-fan September 4, 2019 21:38
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in dde3931 Sep 5, 2019

// always add the projection, which will produce unsafe rows required by some operators
ProjectExec(project, withFilter) :: Nil
val withProjection = if (withFilter.output != project || !batchExec.supportsColumnar) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, shall we do the same for streaming scan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not as familiar with the streaming side. If we support columnar reads there, we probably can.

@rdblue
Copy link
Contributor Author

rdblue commented Sep 5, 2019

Thanks for merging this, @cloud-fan!

cloud-fan added a commit that referenced this pull request Sep 10, 2019
…scan

### What changes were proposed in this pull request?

Remove the project node if the streaming scan is columnar

### Why are the changes needed?

This is a followup of #25586. Batch and streaming share the same DS v2 read API so both can support columnar reads. We should apply #25586 to streaming scan as well.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #25727 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 15, 2019
…batches

### What changes were proposed in this pull request?

Remove unnecessary physical projection added to ensure rows are `UnsafeRow` when the DSv2 scan is columnar. This is not needed because conversions are automatically added to convert from columnar operators to `UnsafeRow` when the next operator does not support columnar execution.

### Why are the changes needed?

Removes an extra projection and copy.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes apache#25586 from rdblue/SPARK-28878-remove-dsv2-project-with-columnar.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 15, 2019
…scan

### What changes were proposed in this pull request?

Remove the project node if the streaming scan is columnar

### Why are the changes needed?

This is a followup of apache#25586. Batch and streaming share the same DS v2 read API so both can support columnar reads. We should apply apache#25586 to streaming scan as well.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes apache#25727 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants